REMARKS 

Claims 1-6, 8-14, 16, 20-23, 49-52, 74-97, and 121-126 were pending, with claims 4, 
12, 16, 20-23, 49, 81-83, and 91-94 withdrawn from consideration. With this amendment, 
claims 1-2, 4, 8-9, 1 1-14, 21, 49-50, 81-83, 85, 87, 89, 91, 93, 95, 122, and 124-125 have been 
amended for clarity, claims 22-23, 121, 123, and 126 have been cancelled without prejudice 
and new claims 128-142 have been added. Thus, upon entry of the present amendments, 
claims 1-6, 8-14, 16, 20-21, 49-52, 74-97, and 122, 124-125, and 128-142 will be pending. 

Support for the amendments to claim 1 and new claims 128-142 is found in the 
following table. 



Claim 


Support in specification. 


1 . (Currently amended) A method for constructing a variant 
set for modifying an antibody of interest, the method 
comprising: 

(a) identifying a plurality of positions in said 
antibody of interest and, for each respective position in said 
plurality of positions, one or more substitutions for the 
respective position, wherein the plurality of positions and 
the one or more substitutions for each respective position in 
the plurality of positions collectively define an antibody 
sequence space; 

(b) selecting a first plurality of variants of the 
antibody of interest, thereby forming a variant set, wherein 
said variant set comprises a subset of said antibody sequence 
space; 

(c) measuring a property of all or a portion of the 

variants in said variant sot; 

^ (c) modeling ee- . using a suitably prosraniiiied 
computer, a computer a plurality of sequence-activity 


Section 5.4.2, beginning on 
page 64, line 25, outlines 
embodiments in which all 
measurements have previously 
been made 

Figure 1 shows a computer that 
can be suitably programmed to 
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Claim 



Support in specification. 



relationships, wherein 

each respective sequence-activity relationship in 
said plurality of sequence-activity relationships is between 
(i) a respective plurality of descriptors, wherein each 
descriptor in the respective plurality of descriptors is a 
descriptor for one or more corresponding substitutions at 
one or more corresponding positions of the antibody of 
interest in the variant set and (ii) a plurality of quantitative 
measures of [[the]] a propert y, wherein each quantitative 
measure in the plurality of quantitative measures is a 
measurement of the p ro perty exhibited by measured for 
each a different variant in all or said portion of the variants 
m the variant se t, and wherein 

each respective sequence-activity relationship in 
said plurality of sequence-activity relationships includes at 
least one descriptor that is found in another 
sequence-activity relationship in said plurality of 
sequence-activity relationships: 

each respective sequence-activity relationship in 
said plurality of sequence-activity relationships is either 

(A) between (i) a subset of the plurality of 
descriptors and (ii) the plurality of quantitative measures or 

(B) between (i) the plurality of descriptors 
and (ii) a subset of the plurality of quantitative measures: 

each descriptor in the plurality of descriptors is 
weighted by a corresponding weight in a plurality of 
weights, and wherein said modeling comprises determining, 
for each respective weight in the plurality of weights, a 
respective first value and a respective second value for the 
respective weight using sequence-activity relationships in 
the plurality of sequence-activity relationships that include 
the descriptor corresponding to the weight, wherein the 



run the claimed method; Figure 
4 and Figure 5 and page 60, 
lines 7-27, disclose modeling a 
plurality of sequence-activity 
relationships in which data is 
left out of each 
sequence-activity relationship 
modeling trial; Section 5.3.1 
beginning on page 62, line 10 
discloses combining the weighs 
of the plurality of 
sequence-activity relationships, 
where each of the 
sequence-activity relationships 
comprises a plurality of 
weighted descriptors, and where 
the combining provides a 
weight for each descriptor (first 
value) and a measure of 
confidence in the weight 
(second value); see also 
Equation 3 on page 50 and 
accompanying text that 
describes descriptors of a 
sequence-activity relationship 
and Equation 5 on page 5 1 
which describes linear weights 
for descriptors in a 
sequence-activity relationship 
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Claim 



Support in specification. 



respective second value is a measure of confidence in the 
respective first value 

, and then deriving from the sequence activity 
relationship (A) a first value for a contribution to the 
measured property by the one or more substitutions at one or 
more positions in the plurality of positions in the antibody of 
interest, and (B) a second value quantifidng a confidence 
with which the contribution to the measured property by the 
one or more substitutions at one or more positions of the 
antibody of interest can bo assigned; and 

(e) outputting said first value and said second value 
to a user, a display, or a tangible computer readable storage 
medium . 



This outputting step was not in 
the claim as originally filed 



128. (New) The method of claim 1, wherein the respective 
first value is an average of the respective weight from each 
sequence-activity relationship in the plurality of 
sequence-activity relationships that includes the respective 
weight. 



Page 62, line 21, through page 
63, line 13. 



129. (New) The method of claim 1, wherein the respective 
second value for a first weight in the plurality of weights is a 
standard deviation of the respective weight from each 
sequence-activity relationship in the plurality of 
sequence-activity relationships that includes the respective 
weight. 



Page 62, lines 26-30 



130. (New) The method of claim 1, the method further 
comprising: 

(d) selecting a second plurality of variants of the 
antibody of interest, wherein each variant in said second 
plurality of variants has one or more substitutions, wherein a 



Section 5.4 beginning on page 
63, line 26 and, in particular, 
page 65, lines 9-12 



20 



LAI-3040581v2 



Claim 


Support in specification. 


substitution in the one or more substitutions is characterized 
by a descriptor in the pluraUty of descriptors having a 
weight for which the modeling (c) determined a first value 
and a second value, wherein 

the first value exceeds the second value, 
the first value is an average or mean of the 
weight from each sequence-activity relationship in the 
plurality of sequence-activity relationships that includes the 
weight. 




131. (New) The method of claim 130, wherein the second 
value is a standard deviation and the first value exceeds the 
second value by at least the standard deviation. 


Section 5.4 beginning on page 
63, line 26 and, in particular, 
page 65, lines 9-12 


132. (New) The method of claim 130, wherein the second 
value is a standard deviation and the first value exceeds the 
second value by at least twice the standard deviation. 


Section 5.4 beginning on page 

C C XT C 

63, line 26 and, in particular, 
page 65, lines 9-12 


133. (New) The method of claim 130, wherein the second 
value is a standard deviation and the first value exceeds the 
second value by at least three times the standard deviation. 


Section 5.4 beginning on page 

^ G AT G 

63, line 26 and, in particular, 
page 65, lines 9-12 


134. (New) The method of claim 1, wherein the respective 
second value for a first weight in the plurality of weights is a 

variance of the respective first value for the first weight. 


Page 62, lines 26-30 


135. (New) The method of claim 1, implemented on a 
computer. 


Figure 1 shows a computer that 
can be suitably programmed to 
run the claimed method. 


136. (New) The method of claim 130, wherein a 
substitution in the one or more substitutions is found in at 
three variants in the variant set. 


Page 65, lines 13-17 
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Support in specification. 






137. (New) The method of claim 130, wherein a first value 
for a descriptor is positive. 


Page 65, lines 6-17 


138. (New) The method of claim 1, wherein a respective 
first value is a mean of the corresponding weight fi-om each 
sequence-activity relationship in the plurality of 
sequence-activity relationships that includes the respective 
weight. 


Page 62, line 21, through page 
63, line 13; page 56, lines 19-20 

where mean is defined as a form 
of average. 


139. (New) The method of claim 1, wherein a first value of 
a weight in the plurality of weights is a regression 
coefficient for the corresponding descriptor. 


Page 61, lines 31-34 


140. (New) The method of claim 1, wherein a first value of 
a weight in the plurality of weights for a sequence-activity 
relationship in the plurality of sequence-activity 
relationships is determined by a relative contribution of the 

corresponding descriptor in the plurality of descriptors to 

XT O XT JT w/ Jr 

the property of the antibody in the sequence-activity 
relationship. 


Page 61, lines 31-34 


141. (New) The method of claim 1, wherein a first value of 
a weight in the plurality of weights for a sequence-activity 
relationship in the plurality of sequence-activity 
relationships is determined by an absolute contribution of 
the corresponding descriptor in the plurality of descriptors 
to the property of the antibody in the sequence-activity 
relationship. 


Page 61, lines 31-34 


142. (New) The method of claim 1, the method further 
comprising: 


Page 60, lines 29-33 
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Support in specification. 


(d) using said modeling (c) to identify a variant of 
the antibody of interest that has a high probability of having 
an improved property relative to the antibody of interest. 





Claims 2, 4, 8-9, 11-14, 21, 49-51, 81-83, 85, 87, 89, 91, 93, 95, 122 and 124 were 
amended for clarity and to correct for antecedent basis. Accordingly, no new matter has been 
added by way of the amendments to the claims, 

THE 35 U.S.C. §§ 102 AND 103 REJECTIONS SHOULD BE WITHDRAWN 

The PTO has rejected claims 1-3, 5-11, 13-14, 50-52, 74-80, 89-90, 95-97, and 121-126 
as being unpatentable over Gustafsson et al,^ 2001, Joumal of Molecular Recognition 14, 
308-314 (hereinafter "Gustafsson 2001"), United States Patent No. 7,1 17,096 to Luo 
(hereinafter "Luo")^ and United States Patent Publication No. 20040072245 to Gustafsson et 
al,^ (hereinafter "Gustafsson 2004"). Applicants traverse the claim rejections. 

Claim 1 has been amended to specify that a plurality of sequence-activity relationships 
is computed, where each sequence-activity relationship is computed with an incomplete data 
set. In the first mode specified in claim 1 , each respective sequence-activity relationship in the 
plurality of sequence-activity relationships is computed between (i) a subset of a plurality of 
descriptors and (ii) a plurality of quantitative measures. The quantitative measures are 
measurements of a physical property of variants in the variant set. Each descriptor in the 
plurality of descriptors is for one or more possible mutations in the antibody of interest. So, in 
this mode, each respective sequence-activity relationship is computed using only a subset of 
the plurality of descriptors. In other words, for each sequence-activity relationship, some 
descriptors are left out to compute a particular sequence-activity relationship in this mode. 



On page 3 of the February 18, 2009 Office Action, the Examiner cites this references as Peizi et al. (US 
7171 7096), Applicants assume that the PTO is referring to United States Patent No. 7,11 7,096 to Luo Peizhi et aL 
Applicants respectfully request that United States Patent No, 7,1 17,096 to Luo Peizhi et aL be made of record and 
that U.S. 1\1\1{)96 to Peizi et al., which is not a reference an existing U.S. patent, be expunged from the record. 

2 

Each descriptor in the plurality of descriptors is, for example, (i) a substitution at a position in the 
plurality of positions represented by all or a portion of the variants in the variant set, (ii) a plurality of substitutions 
at a position in the plurality of positions represented by all or a portion of the variants in the variant set, or (iii) one 
or more substitutions in one or more positions in the plurality of positions represented by all or a portion of the 
variants in the variant set. See, for example, claim 8, 
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In the second mode specified by claim 1 , each respective sequence-activity relationship 
in the plurality of sequence-activity relationships is computed between (i) the plurality of 
descriptors and (ii) a subset of the plurality of quantitative measures. So, in this second mode, 
each respective sequence-activity relationship is computed using only a subset of the 
quantitative measures but all of the descriptors. In other words, for each sequence-activity 
relationship, all the descriptors are considered, but some of the variants in the variant set (more 
specifically, some of the physical measurements for some of the variants in the variant set) are 
left out. 

The omission of data when computing a plurality of sequence-activity relationships, 
either by the first mode or the second mode specified in claim 1 , is discussed, for example, on 
page 60, lines 7-17, of the specification in conjunction with Figures 4 and 5. As specified in 
Figures 4 and 5, in some instances, data is removed (either descriptors or physical 
measurements) from the data set (e.g., on a random basis) before each structure-activity 
relationship is computed. 

Claim 1 , as amended further specifies that each descriptor in the plurality of descriptors 
is weighted by a corresponding weight in a plurality of weights. Thus, a sequence-activity 
relationship in the plurality of sequence-activities of relationship can, for example, have the 
form set forth in Equation 5 on page 5 1 of the specification, reproduced below: 

(Eq. 5) Y = wixi + W2X2, + WiXi 

Specification, page 51, line 30 

In Equation 5, Y is a quantitative measure of a property of the antibody, xi is a descriptor of a 
substitution, a combination of substitutions, or a component of one or more substitutions in the 
sequence of the antibody, and wi is the weight of the corresponding descriptor xi. 

Claim 1 further specifies that the modeling comprises determining, for each respective 
weight in the plurality of weights (for each Wi in the set wi, W2, . . wi), a respective first value 
and a respective second value using sequence-activity relationships in the plurality of 
sequence-activity relationships that include the descriptor corresponding to the weight. Thus, 
for example, consider the case where four sequence-activity relationships are modeled, each 
with a subset of the data (either restricted to less than all the descriptors or less than all the 
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physical property measurements of the variants). These sequence-activity relationships may 
have the form: 

Ya = WalXi + Wa2X2, + WaiXi 
Yb WblXi + Wb2X2, + WbiXi 
Yc = WclXi + Wc2X2, + WciXi 
Yd = WdlXi + Wd2X2, + WdiXi 

For each weight wi, a respective first value and a respective second value is determined. In this 
example, a respective first value and a respective second value are computed for wi from the 
values for wi in four structure-activity relationships (wai, Wbi, Wd, and Wdi). For example, as 
specified on page 60, lines 15-16, of the specification, an average of Wai, Wbi, Wci, and Wdi may 
be obtained. This average of Wai, Wbi, Wci, and Wdi would constitute the first value for the 
weight specified in claim 1 . 

Claim 1, as amended, further specifies that second value for the weight is a measure of 
confidence in the first value. Thus, in the case where the first value for the weight is an average, 
the second value is a measure of confidence in the average. See, for example, page 60, line 22, 
of the specification where a measure of confidence of the weight function is described, as well 
as page 62, lines 26-30, where measures of confidence of the first value, such as standard 
deviation and variance of the first value, are described. 

Claim 1 provides substantial advantages over the prior art because each of the weights 
for the descriptors in the plurality of sequence-activity relationships becomes characterized by 
a first and second value. These values can be used to determine which descriptors are more 
important in influencing a physical property of the antibody under study. 

Presumably, the descriptors having weights that have large, positive first values (e.g,^ 
large averages, means, etc.) will have more influence on the physical property of the antibody 
than descriptors having weights with small first values. However, this is not necessarily the 
case. If a descriptor has a weight that has a large first value (e.g., a larger mean or average), and 
an even larger second value (where the second value is the measure of confidence in the first 
value expressed, for example as a standard deviation or mean of the first value), then the 
confidence in the first value is substantially reduced. 

The information obtained using the above-described method is exploited in some 
claimed embodiments to redefine the variant set. For example, new claim 130 specifies 
selecting a second plurality of variants of the antibody of interest, where each variant in the 
second plurality of variants has one or more substitutions, where each respective substitution in 
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the one or more substitutions is characterized by a respective descriptor having a respective 
weight for which the modeling of claim 1 determined a first value and a second value, where 
the first value exceeds the second value by a multiple and where the first value is an average of 
the respective weight from each sequence-activity relationship in the plurality of 
sequence-activity relationships that includes the respective weight. Thus, new claim 130 
requires that only descriptors having weights in which their corresponding first values exceed 
their second values by a multiple are allowed to be used in the new variant set. In other words, 
only those substitutions for which there is provided a certain requisite confidence can be used 
in the second variant set. 

Claim 131 specifies that the second value is a standard deviation and the multiple is 1. 
Claim 132 specifies that the second value is a standard deviation and the multiple is 2. Claim 
133 specifies that the second value is a standard deviation and the multiple is 3. See also page 
65, lines 9-12, of the specification where states: 

(ii) A substitution can be selected if it has a positive regression coefficient, 
weight or other value describing its relative or absolute configuration to one or 
more activity of the antibody, and it s at least one standard deviation, preferably 
two standard deviations or more preferably three standard deviations above 
neutrality. 

The computation of a first value (e.g. average) and second value (e.g. standard deviation, 
variance or some other measure of confidence) for each weight for each descriptor used in a 
variant set has been used to successful identify variants with improved activity. See for 
example, Liao et al.^ 2007, "Engineering proteinase K using machine leaming and synthetic 
genes", BMC Biotechnology 7, 16, attached hereto as Appendix A (hereinafter "Liao"). Liao 
used the criteria that all substitutions in the redesigned variant set must (i) have occurred at 
least three times in the original variant set and (ii) have a mean weight (first value) that is at 
least one standard deviation (second value) above zero. Liao found that this criteria was 
always better than using the altemative criteria that all substitutions in the redesigned variant 
set must (i) have occurred at least three times in the original variant set and (ii) have a mean 
weight (first value) that is simply above zero. See caption to Figure 4 of Liao where it is stated 
in the second to last line that "[n]ote that this third design in each group is always better than 
the second." 

^ Applicants note that Liao makes use of the protein proteinase k, which is not an antibody. Nevertheless, 
Applicants respectfully submit that the design principles that were validated in Liao are applicable to antibodies. 
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Thus, the claimed invention provides novel methods for identifying the descriptors (e.g, 
substitutions) in an antibody that should lead to an improvement in a physical property of 
antibody variants in a redesigned variant set relative to the original antibody of interest. 
Gustafsson 2001, Luo and Gustafsson 2004 do not disclose, teach or suggest the computation 
of multiple sequence-activity relationships, where each sequence-activity relationship is 
computed using an incomplete data set of descriptors or variants and where each 
sequence-activity relationship is characterized by a plurality of descriptors, where each 
descriptor has a corresponding weight. Moreover, Gustafsson 200 1 , Luo and Gustafsson 2004 
do not disclose, teach or suggest calculating a first and second value for each of the weights, 
where the second weight is a measure of confidence in the second weight. Thus, Gustafsson 
200 1 , Luo and Gustafsson 2004 do not disclose, teach or suggest what is specified by claim 1 . 

Each of the remaining rejected claims depends from claim 1 and thus is patentable over 
any combination of Gustafsson 2001, Luo and Gustafsson 2004 for at least the same reasons. 
Thus, Applicants respectfully request that the 35 U.S.C. §§ 102 and 103 rejections be 
withdrawn. 

THE OBVIOUS-TYPE DOUBLE PATENTING REJECTIONS SHOULD BE 

WITHDRAWN 

Claims 1-6, 8-14, 16, 20-23, 49-52, 74-97, and 121-126 are provisionally rejected under 
the judicially created doctrine of obviousness-type double patenting as being unpatentable over 
claims 1-31 of United States Patent Application Number 10/379,378 (published as U.S. 
2004/0072245) in view of Luo. Applicants respectfully submit that this double-patenting 
rejection should be withdrawn because the subject matter of United States Patent Application 
Number 10/379,378 and the claimed invention were not owned by the same person or subject 
to an obligation to the same person at the time the claimed invention was made. 

The inventors of the above-identified patent application are Claes Gustafsson, Sridhar 
Govindarajan, and Jeremy MinshuU, co-founders of DNA2.0 along with Dr. Jon Ness. 
DNA2.0 was founded in early 2003. Prior to founding DNA2.0, Claes Gustafsson, Sridhar 
Govindarajan, and Jeremy Minshull worked at Maxygen and were under an obligation to 
assign inventions to Maxygen. 

The claimed invention was conceived and made by Claes Gustafsson, Sridhar 
Govindarajan, and Jeremy Minshull after they left Maxygen and formed DNA2.0. Moreover, 
at the time the invention was conceived and made, Claes Gustafsson, Sridhar Govindarajan, 
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and Jeremy MinshuU were under an obligation to assign their rights to DNA2.0. This 
obligation was effected by an assignment by the inventors of all rights in the above-identified 
patent application to DNA2.0 that was recorded with the United States Patent and Trademark 
Office on January 31, 2006 at Reel 017907, Frame 0397. 

The inventors of United States Patent Application Number 10/379,378 are Claes 
Gustafsson, Sridhar Govindarajan, Robin A, Emig, Richard John Fox, Jeremy S. Minshull, 
Davis, S. Christopher, Anthony R. Cox, Phillip A. Patten, Linda. A Castle, Daniel L. Siehl, 
Rebecca Lynne Gorton, and Teddy Chen. The invention of United States Patent Application 
Number 10/379,378 was made while Claes Gustafsson, Sridhar Govindarajan, and Jeremy 
Minshull were employees of Maxygen and under an obligation to assign the invention of 
United States Patent Application Number 10/379,378 to Maxygen. The assignment by the 
inventors of all rights in United States Patent Application Number 10/379,378 to Maxygen was 
recorded on August 22, 2003 at Reel 014465, Frame 0084. None of the inventors of United 
States Patent Application Number 10/379,378 had an obligation to assign the subject matter of 
application 10/379,378 to DNA2.0. 

Claims 1-6, 8-14, 16, 20-23, 49-52, 74-97, and 121-126 are also provisionally rejected 
under the judicially created doctrine of obviousness-type double patenting as being 
unpatentable over claims 1-102 of United States Patent Application Number 10/874,802 
(published as U.S. 2005/0084907) in view of Luo. Applicants respectfully submit that this 
double-patenting rejection should be withdrawn because the subject matter of United States 
Patent Application Number 10/874,802 and the claimed invention were not owned by the same 
person or subject to an obligation to the same person at the time the claimed invention was 
made. 

The inventor of United States Patent Application Number 10/874,802 is John Richard 
Fox, who has no obligation, and was never under any obligation,to assign rights in the 
10/874,802 application to DNA2.0. John Richard Fox assigned all of his rights in United 
States Patent Application Number 10/874,802 to Maxygen in an assignment that was recorded 
on November 23, 2004 at Reel 015389, Frame 0152. 

For the above-identified reasons Applicants respectfully submit that the 
double-patenting rejection should be withdrawn. 
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CONCLUSION 

Applicants respectfully request entry of the foregoing amendments and remarks into 
the file of the above-identified application. If any fees are due in connection with this 
submission, please charge the required fee to Jones Day Deposit Account No. 50-3013. 



Respectfully submitted, 

Date: July 20, 2009 / Brett Lovejoy / 42,813 

Brett Lovejoy (R^g- No.) 

JONES DAY 

222 East 4 Street 

New York, New York 10017-6702 

Telephone: (415) 875-5744 
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APPENDIX A 

Copy of Liao et al.^ 2007, "Engineering proteinase K using machine learning and synthetic 

genes", BMC Biotechnology 7, 16 
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Abstract 

Background: Altering a protein's function by changing its sequence allows natural proteins to be 
converted into useful molecular tools. Current protein engineering methods are limited by a lack 
of high throughput physical or computational tests that can accurately predict protein activity under 
conditions relevant to its final application. Here we describe a new synthetic biology approach to 
protein engineering that avoids these limitations by combining high throughput gene synthesis with 
machine learning-based design algorithms. 

Results: We selected 24 amino acid substitutions to make in proteinase K from alignments of 
homologous sequences. We then designed and synthesized 59 specific proteinase K variants 
containing different combinations of the selected substitutions. The 59 variants were tested for 
their ability to hydrolyze a tetrapeptide substrate after the enzyme was first heated to 68°C for 5 
minutes. Sequence and activity data was analyzed using machine learning algorithms. This analysis 
was used to design a new set of variants predicted to have increased activity over the training set, 
that were then synthesized and tested. By performing two cycles of machine learning analysis and 
variant design we obtained 20-fold improved proteinase K variants while only testing a total of 95 
variant enzymes. 

Conclusion: The number of protein variants that must be tested to obtain significant functional 
improvements determines the type of tests that can be performed. Protein engineers wishing to 
modify the property of a protein to shrink tumours or catalyze chemical reactions under industrial 
conditions have until now been forced to accept high throughput surrogate screens to measure 
protein properties that they hope will correlate with the functionalities that they intend to modify. 
By reducing the number of variants that must be tested to fewer than 100, machine learning 
algorithms make it possible to use more complex and expensive tests so that only protein 
properties that are directly relevant to the desired application need to be measured. Protein design 
algorithms that only require the testing of a small number of variants represent a significant step 
towards a generic, resource-optimized protein engineering process. 
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Background 

Protein properties that are relevant to real-world applica- 
tions are often difficult to manipulate using either of the 
current protein engineering paradigms [1-3]: structure- 
based protein design [4,5] or directed evolution [6-8]. 
Both methods have shortcomings and advantages that 
have been discussed and compared elsewhere [1-3]. Chief 
amongst the limitations of both methods is the require- 
ment for high throughput computational or physical tests 
to evaluate protein variants for suitability to a specific 
application. A common problem with both approaches is 
that frequently there are no high throughput tests for real 
applications. For example, there are no high throughput 
tests for measuring how well a protease will remove grass 
stains from jeans, how quickly an antibody will shrink a 
tumour, or how immunogenic a potential vaccine antigen 
will be. As a consequence, protein engineers are fre- 
quently forced to compromise. Thus a structure-based 
approach in which the effects of large numbers of amino 
acid changes on the active site are calculated may require 
the protein engineer to consider only the affinity of an 
enzyme for its substrate and product while ignoring the 
effects that temperature and solvent conditions may have 
on the enzyme. Similarly an empirical library based 
approach in which large numbers of randomly produced 
viral antigen variants are tested for activity may allow the 
protein engineer to measure their binding to antibodies 
already known to be neutralizing, but would prohibit 
direct measurement of the production of such antibodies 
in animals exposed to the antigens. 

Many non-biotechnological engineering endeavours pose 
similar challenges to those found in protein engineering: 
a large number of independent variables and cost-prohi- 
bitions against exhaustive search. Such diverse tasks as 
fuel formulation, clinical trial design and chemical proc- 
ess optimization are solved using experimental designs to 
combine variables in specific ways, and regression analy- 
sis techniques to dissect out the contribution of each var- 
iable to the outcome [9]. The common goal in all these 
areas of optimization is to keep the total number of activ- 
ity measurements small enough to allow complex func- 
tional tests that are directly relevant to the final 
application. 

Multivariate data analysis has been used to optimize small 
molecules and peptides for nearly a quarter of a century 
[10-16]. In their paper describing chemical synthesis of a 
gene in 1984, Benner and colleagues suggested that sys- 
tematic variation of amino acids could provide an under- 
standing of the relationship between a protein's sequence 
and its function [17]. Until recently, however, synthesis of 
specifically designed individual genes has been suffi- 
ciently difficult to effectively preclude the construction of 
designed gene sets and meaningful testing of analytical 



predictions. Such efforts have thus been largely confined 
to the synthesis of very small numbers of discrete polynu- 
cleotide [18] or protein variants [19], or to the analysis of 
variants produced in a library [20-22]. 

A synthetic biology approach to protein engineering has 
been enabled by recent advances in gene synthesis tech- 
nology [23-26] that permit cost-effective synthesis of indi- 
vidually specified gene sequences instead of relying on 
creation of libraries of variant sequences [27,28]. The fea- 
sibility of producing tens or hundreds of protein variants 
in which all amino acid changes are precisely specified 
allows the sequences and activities of these variants to be 
analyzed using multivariate regression and machine 
learning techniques adapted from optimization tasks 
found in other engineering disciplines. 

We have tested this protein engineering approach by 
increasing the activity and heat stability of proteinase K. 
We selected 24 amino acid substitutions, then designed, 
synthesized and tested 59 genes containing combinations 
of these changes. We tested 8 different machine learning 
algorithms for their ability to identify the amino acid 
changes with a beneficial effect on proteinase K activity by 
using them to design new variants with improved combi- 
nations of substitutions. In 3 design cycles we synthesized 
genes encoding a total of 95 enzymes (~100 kb of syn- 
thetic genes), some of which had 20 times higher activity 
than the wild type protein. All 8 algorithms produced 
enzyme designs that were substantially improved over 
wild type. The results show that machine learning models 
of protein sequence and activity combined with efficient 
gene synthesis can be valuable tools in engineering pro- 
teins with improved properties. 

Results and Discussion 

i. Selection of proteinase K as a test system 

To test machine learning-based protein engineering we 
chose to optimize proteinase K-catalyzed hydrolysis of the 
tetrapeptide N-Succinyl-Ala-Ala-Pro-Leu p-nitroanilide 
following a heat- treatment of the enzyme. We selected 
this activity because it mimics a key characteristic of prac- 
tical protein optimization; target activities frequently 
result from a combination of protein properties, in this 
case expression and post-translational processing in a het- 
erologous host, catalytic activity and thermostability. 

The gene encoding proteinase K from Tritirachium album 
[29] was re-synthesized with an£. coli codon bias [30,31] 
and cloned into an arabinose-inducible E. coli expression 

vector. The nucleotide and amino acid sequences of this 
initial ("wild-type") proteinase K sequence are shown in 
Additional file 1 . 
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2. Engineering proteinase K: design methods 

2. 1 Overview of the method 

The protein engineering method described here involves 
the following steps. 

i) Selection of amino acid substitutions to incorporate 
into a target protein. 

ii) Design of protein variants containing different combi- 
nations of those substitutions. 

iii) Synthesis of genes encoding the protein variants. 

iv) Expression of the protein variants. 



v) Measuring the activity of the protein variants. 

vi) Analysis of protein variant sequences and activities to 
assess the contribution of each amino acid substitution. 

vii) Design of a new set of variants using the information 
from vi) . 

viii) Iteration of steps iii) to vii). 

These steps are described in detail for the engineering of 
proteinase K in the Results sections noted in Figure 1 . 



Starting point: wt proteinase K (Results section 1) 

Choose substitutions (Results section 2.2) 

; 

Design Variant Set 1 (1-1 to 1-59) (Results section 2.3) 

; 

Test Variant Set 1 (Results section 2.4) 

i 

Machine learning from data for Variant Set 1 (Results section 2.5) 

; 

Design of Variant Set 2 (Results section 2.6) 
A: Optimal (2-1 to 2-6) B: Exploring (2-7 to 2-20) 

I 

Machine learning from data for Variant Sets 1 & 2 (Results section 2.7) 

; 

Design & Testing of Variant Set 3 (Results section 2.8) 

A: All positive substitution weights (3-1 to 3-5) 
B: Substitution weights > 1 SD above 0 (3-6 to 3-9) 
C: Substitutions considering interactions between residues (3-10 to 3-16 



Figure I 

Flowchart of protein engineering design and testing process. The figure shows an overview of the experimental flow 
described in this work. Details are provided for each step in the indicated section of Results and Discussion. 
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2.2 Selection of amino acid substitutions 

We planned on synthesizing a total of less than 100 vari- 
ants containing combinations of a limited set of amino 
acid substitutions. To define a search space that could be 
effectively explored by synthesizing such a small number 
of variants we chose to use twenty four amino acid substi- 
tutions within the ~370 amino acid proteinase K: less 
than 0.5% of the total number of single amino acid 
changes possible. 

To select the substitutions, a set of serine proteases with 
>30% amino acid identity to proteinase K were identified 
by using the BLAST algorithm to search Genbank. This 
search produced 3 groups of homologous sequences. 
Group A contained the wild type and 5 close homologs 
(>90% amino acid identity). Group B contained 42 more 
distant homologs (between 30% and 90% amino acid 
identity). Group C contained 1 1 homologs (>30% amino 
acid identity) that were either reported in the literature to 
be thermostable or were >90% identical to a known ther- 
mostable sequence. Genbank accession numbers are pro- 
vided in Additional file 1 . 

The homologs were aligned using clustalW[32], to iden- 
tify the amino acids in each homolog that corresponded 
with the amino acid found in wild type proteinase K at 
each position. To increase the probability that at least 
some of the substitutions would increase activity, we 
selected 24 substitutions based on several different criteria 
[33]: (i) the substitution was reported in the literature to 
increase the stability of the serine protease subtilisin 
(N95Q P97S, E138A, M145F, L299Q I310K) [34,35]; (ii) 
the amino acid occurred within the homolog set (S107D); 
(iii) the amino add occurred within >70% of homologs 
from thermophilic organisms and within other homologs 
(S123A, 1132V, L180I, R237N, S273T, G293A, K332R, 
S337N); (iv) the amino acid was found within the 
homolog set and the substitution from the wild type resi- 
due is favourable in the Dayhoff substitution matrix 
(V1671, A199S, V267I) [36]; (v) principal component 
analysis of amino acids responsible for clustering of 
homologs from thermophilic organisms (K208H, A236V) 
[37]; (vi) literature reports that the P5S substitution has a 
stabilizing effect in subtilisin [34,35] AND appearance of 
a P to S substitution in the closest proteinase K homolog 
(P265S, P355S); (vii) the change occurred in a close 
homolog that is also thermostable (Y151A) and (viii) a 
random mutation identified during synthesis of the wild 
type (Y194S). This information is summarized in Table 1 

We emphasize that the method used for choosing amino 
acid substitutions is independent of the subsequent 
machine learning analysis. Substitutions could be selected 
by any of the many available methods including analysis 
of protein structures [38] or comparison of homologous 



sequences [39]. A more detailed review of combined 
methods for substitution selection has been published 
elsewhere[40]. 

2.3 Design of variant set / 

In order to test the machine learning algorithms, we 
needed to obtain a set of variants with corresponding 
activity measurements. For the most accurate analysis 
each substitution should be approximately equally repre- 
sented, and should occur with as many different substitu- 
tions (ie in different sequence contexts) as possible. We 
encountered two somewhat related obstacles to creating 
such a variant set. Firstly, all of the substitutions were pre- 
viously untested, so we did not know how many would 
completely inactivate the enzyme. Secondly we did not 
know how tolerant proteinase K would be to changes, that 
is how many amino acids we could change in a single var- 
iant and retain activity. The initial set of 59 variants was 
therefore designed in several stages as we obtained infor- 
mation about these parameters. The sequences of all vari- 
ants synthesized are shown in Additional file 2. 

i) First a set of 24 variants was designed with combina- 
tions of substitutions selected randomly but with the con- 
straint that each of the 24 substitutions occurred 6 times 
and each variant contained 6 substitutions (1-2 to 1-25, 
design method B in Additional file 2). Of these 24 variants 
only one (1-13) was active after heat-treatment. To deter- 
mine whether the low survival rate was because the substi- 
tutions destroyed all proteinase K activity, or because the 
substitutions primarily affected the heat sensitivity, we 
also measured the activity of all 24 variants without heat- 
ing. Under these less stringent conditions we found three 
additional variants that were active (1-2, 1-6 and 1-12). 
Eighteen of the 24 substitutions were present in 1 or more 
of these 4 variants with detectable proteinase K activity 
and thus did not completely inactivate the enzyme. 

ii) To see which of the remaining 6 substitutions 
destroyed proteinase K activity, we synthesized variants 
containing the substitutions that had not occurred within 
an active variant (1-26 to 1-33, design method C in Addi- 
tional file 2). Variants containing N95C, P97S, E138A, 
A236V and L299C were completely inactive, so these sub- 
stitutions were eliminated from further designs. 

iii) To obtain a larger number of active variants for mode- 
ling using machine learning we designed 10 variants by 
arbitrarily combining 3 substitutions that had appeared 
previously in active variants (1-34 to 1-43, design 
method D in Additional file 2) and 6 variants by arbitrar- 
ily combining 5 substitutions that had appeared previ- 
ously in active variants (1-44 to 1-49, design method E in 
Additional file 2) . 
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Table I : Amino acid substitutions selected for modification of proteinase K. 



Substitution 


Effect 


Reason for Selection 


N95C 


Lethal 


Literature report: disulphide bond between 95C and 299C reported to stabilize subtilisin BPN' (S3C and Q206C 
in subtilisin)[34,35]. 


P97S 


Lethal 


Literature report: P to S reported to increase stability in subtilisin BPN' (P5S in subtilisin) [34,35]. 


SI07D 


Negative 


Homolog sequence alignment analysis: D present at this position in 2/42 Group B homologs. 


SI 23 A 


Positive 


Thermostable homolog sequence alignment analysis: residue found in 8/1 1 Group C homologs and 6/42 Group B 
homologs. 


II 32V 


Positive 


Thermostable homolog sequence alignment analysis: residue found in 10/11 Group C homologs, 1/6 Group A 
homologs and 1 3/42 Group B homologs. Also a favorable change according to Dayhoff substitution matrix[36]. 


EI38A 


Lethal 


Literature report: acidic residue to A reported to increase stabibility in subtilisin BPN' (D4IA in 
subtilisin)[34,35]. 


MI45F 


Negative 


Literature report: M to F reported to increase stabibility in subtilisin BPN' (M50F in subtilisin) [34,351. 


YI5IA 


Strong positive 


Thermostable homolog sequence alignment analysis: residue found in close thermostable homolog gi| 1 3 1084 and 
2/42 Group B homologs. 


VI 67! 


Negative 


Substitution matrix-derived change: favorable change according to Dayhoff substitution matrix [36]. Residue 
found in 1/6 Group A homologs and 27/42 Group B homologs. 


LI 801 


Positive 


Thermostable homolog sequence alignment analysis: residue found in 10/11 Group C homologs, 1/6 Group A 
homologs and 10/42 Group B homologs. Also a favorable change according to Dayhoff substitution matrix [36]. 


YI94S 


Negative 


Random mutation obtained during synthesis of wt proteinase K. 


AI99S 


Negative 


Substitution matrix-derived change: favorable change according to Dayhoff substitution matrix [36]. Residue 
found in 1/6 Group A homologs and 9/42 Group B homologs. 


K208H 


Positive 


PCA identification of amino acids responsible for clustering of thermophilic sequences gi|4092486; gi|56 160990; 
gi| 1 14081 within Group A and B homologs [37]. 


A236V 


Lethal 


PCA identification of ammo acids responsible for clustering of thermophilic sequences gi [4092486; gi|56 160990; 
gi| 1 14081 within Group A and B homologs [37]. 


R237N 


Negative 


Thermostable homolog sequence alignment analysis: residue found in 9/1 1 Group C homologs, 1/6 Group A 
homologs and 1/42 Group B homologs. 


P265S 


Negative 


Structural considerations: literature report: P5S reported to increase stability in subtilisin BPN' (P5S in subtilisin) 
[34,35]. 265S found at this position in proteinase K closest homolog (gi| 1 3 1084). 


V267I 


Positive 


Substitution matrix-derived change: favorable change according to Dayhoff substitution matrix [36]. Residue 
found in 1/6 Group A homologs and 1/41 Group B homologs. 


S273T 


Positive 


Thermostable homolog sequence alignment analysis: residue found in 1 I/I 1 Group C homologs, 1/6 Group A 
homologs and 29/41 Group B homologs. Also a favorable change according to Dayhoff substitution matrix [36]. 


G293A 


Strong positive 


Thermostable homolog sequence alignment analysis: residue found in 1 I/I 1 Group C homologs, 1/6 Group A 
homologs and 38/4 1 Group B homologs. 


L299C 


Lethal 


Disulphide bond between 95C and 299C reported to stabilize serine proteases [34,35]. 


I3I0K 


Negative 


Literature report: K substitution at this position reported to increase stabibility by adding hydrogen bonding in 
subtilisin BPN' (Y2I7K in subtilisin) [34,35]. 


K332R 


Positive 


Thermostable homolog sequence alignment analysis: residue found in 8/1 1 Group C homologs and 1/6 Group A 
homologs. Also a favorable change according to Dayhoff substitution matrix [36]. 


S337N 


Positive 


Thermostable homolog sequence alignment analysis: residue found in 8/1 1 Group C homologs, 1/6 Group A 
homologs and 2/41 Group B homologs. Also a favorable change according to Dayhoff substitution matrix [36]. 


P355S 


Negative 


Structural considerations: literature report: P5S reported to increase stability in subtilisin BPN' (P5S in subtilisin) 
[34,35]. 355S found at this position in proteinase K closest homolog (gi| 131084). 



Selection criteria and references are shown for 24 amino acid substitutions within proteinase K. Group A, wild type plus 5 closest homologs (>90% 
identity); Group B, 42 homologs (30-90% identity); Group C, I I thermostable homologs. The effect of each substitution is also shown. Lethal: no 
active variant contained this substitution. Negative: the substitution was not selected by any of the third round design methods. Positive: the 
substitution was selected by at least one third round design method and was present in at least one third round variant with activity > 3x wild type. 
Strong positive: the substitution was selected by all third round design methods and are present in the most active variants. 
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iv) Finally we performed a manual analysis of the activity 
data from the first 49 variants, combining substitutions 
that occurred frequently within active variants ( 1 -50 to 1 - 
59, design method F in Additional file 2). 

2.4 Testing variant set I 

To associate protein sequences with functions, we tested 
the ability of the proteinase K variants to hydrolyze N-Suc- 
cinyl-Ala-Ala-F^o-Leu p-nitroanilide. Proteinase K variants 
were expressed in E. coli and purified over a Ni-NTA col- 
umn. The purified proteins were heated to 68 °C for 5 
minutes, cooled and diluted into reaction buffer contain- 
ing substrate. The activities measured are shown in Figure 
2 and in Additional file 2, which also show the activities 
of variants designed in two subsequent design cycles. 

Only 19 of the 59 enzymes in variant set 1 had detectable 
activities after heat treatment [see Additional file 2]. As 
described in section 2.5, we wished to analyze this dataset 
using machine learning algorithms to calculate the values 
of 20 parameters. Using a dataset this sparse will cause 
inaccuracies for the machine learning algorithms. To 
increase the number of datapoints without increasing the 
number of sequences synthesized we also measured the 
activities of all of the enzymes in variant set 1 without the 
heating step. We reasoned that this would provide addi- 
tional information, differentiating combinations of sub- 
stitutions that eliminated enzyme activity entirely from 



those which were simply unable to confer thermostabil- 
ity. More than half (32 of 59) of the variants in set 1 were 
active without a heating step (Additional file 2). 

2.5 Choice of machine learning algorithms and analysis of variant set 
I 

We wished to learn which amino acid substitutions 
increased activity and which were detrimental by analyz- 
ing the sequences and activities of the proteinase K vari- 
ants. The initial dataset was rather sparse: only two 
variants in the initial set (1-40 and 1-50) had an activity 
exceeding that of the wild type, by 2.8-fold and 3.3-fold 
respectively, while 45 possessed less than 10% of the wild 
type activity. Despite the generally low activities of the 
first set of variants, there was a range of activities that we 
analyzed by machine learning. 

To do this we first eliminated five substitutions (N95C, 
P97S, E138A, A236V and L299C) because variants with 
any of these substitutions did not have any detectable 
activity (Additional file 2). We then considered the 
reduced set of 19 substitutions, representing each variant 
as a 19 dimensional bit vector x^, where x^ j is 1 if there is a 
substitution in the variant at position j. We used a bit vec- 
tor since only one possible amino acid substitution was 
used at each position. A test of a protein variant was 
encoded as a pair (x^, yj, where represents the variant 
and y^ the activity measured for this variant. 
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Figure 2 

Three cycles of proteinase K variant design and test- 
ing. Mean activity measurements of the 3 sets of proteinase 
K variants are shown. Set I (diamonds) is the initial set of 59 

variants. Set 2 (squares, 20 variants) was designed using the 
activities of Set I . Set 3 (triangles, 1 6 variants) was designed 
based on sets I and 2. Activities towards N-Succinyl-Ala-Ala- 
Pro-Leu p-nitroanilide were measured at 37°C following a 5 
minutes heat treatment of the enzyme at 68°C. Activities are 
expressed relative to the mean activity of 2 replicates of the 
wild-type proteinase K. 



We selected 8 different machine learning algorithms to 
analyze the data. We used 8 different algorithms because 
we had no way of knowing which, if any, would be suita- 
ble for analyzing protein sequences and activities. The 
algorithms differ in two main ways. First in the way in 
which they calculate the differences between the meas- 
ured activity and the predicted activity (the "loss"), for 
example whether they use the square of the differences 
between measured and predicted activities (square loss), 
or whether they place more weight on differences between 
measured and predicted activities for the more active var- 
iants (matching loss). Second, the algorithms use differ- 
ent regularization functions, which determine for 
example whether preferred solutions use many small 
weights (2-norm) or fewer large weights (1-norm). The 
algorithms used were: ridge regression (RR) [41]; least 
absolute shrinkage and selection operator (Lasso) [42]); 
partial least square regression (PLSR) [43]; support vector 
machine regression (SVMR) [44]; linear programming 
support vector machine regression (LPSVMR) [45]; linear 
programming boosting regression (LPBoostR) [46]; 
matching loss regression (MR) [47,48]; one-norm regular- 
ization matching-loss regression (ORMR) [47,48]. See 
Additional file 1 for detailed descriptions of the algo- 
rithms. 
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Each algorithm was used to build linear models of the 
sequence and activity by calculating a 20-dimensional 
weight vector w, where the activity of a variant Xj is esti- 
mated as y i = (Xj = i..i9WjXi j) + W20. The weight Wj is asso- 
ciated with the j-th substitution, providing a measure of 
the effect of the j-th substitution on proteinase K activity. 
The last weight W20 is an additive shift. The machine learn- 
ing algorithms were used to select values for Wj that 
resulted in the best correlation between the activities that 
had been measured for each variant and the activities pre- 
dicted by the weight vectors w. To do this we created 1 000 
subsamples of the training set (the set of all (x^, yj pairs 

used for a cycle of machine learning) by leaving out 5 ran- 
domly chosen variant sequences for each such subsample. 
For all 8 algorithms we calculated the mean value and the 
standard deviation of each substitution weight Wj over the 

1000 subsamples of the training set. 
2.6 Design of variant set 2 

One objective in designing a second variant set was to see 
whether variants based on the results of machine learning 

analysis had improved activity relative to the training set. 
We also wished to obtain additional data so that we could 
perform a second round of machine learning-based vari- 
ant design, should the first round prove successful. Vari- 
ant set 2 was therefore designed in 3 parts. 

Initially we used each machine learning algorithm to 
select the sequence that it predicted would have the high- 
est activity using the heated activity data from the first set 
(variants 2-1 to 2-6, design method G in Additional file 
2). The effect of any substitution may depend upon the 
other substitutions with which it occurs (see Section 3.7). 
The fewer times that a substitution has been seen, the less 
accurately its average effect is known. We reasoned that a 
substitution should be seen at least three times to estimate 
a meaningful average effect (if the effect in one context is 
positive, and in another context is negative, a third context 
will provide a "tiebreaker"). We therefore excluded substi- 
tutions that had occurred in active variants fewer than 3 
times. To further reduce the chances of incorporating an 
apparently positive substitution that actually had a nega- 
tive effect we only included those substitutions whose 
weights exceeded a threshold. We first normalized all our 
activities against the wild type, resulting in activity 1 for 
the wild type. We then chose the threshold as 0.04 = 1/25, 
where 25 is the original number of weights (24 substitu- 
tions plus a bias term). Note that the final number of sub- 
stitutions somewhat smaller (19). This led to the 
exclusion of M145F, S123A, E132A and V2671 from vari- 
ants 2-1 to 2-6. In designing variant set 3, we improved 
our design method, using the standard deviation for each 



substitution weight instead of an arbitrary threshold (see 
Section 2.8). 

We designed a further 14 variants in set 2 to more thor- 
oughly explore the search space close to already tested var- 
iants and thus to provide sufficient data for a further cycle 
of machine learning. Six of these (2-7 to 2-12, design 
method H in Additional file 2) were designed using each 
machine learning algorithm in turn to select the variant 
with the largest predicted activity based on the mean 
weight for each substitution. To ensure that we designed 
sequences that were different from the first six, we only 
allowed variants between 3 and 5 amino acid changes 
from any tested variant of set 1 or any variant already cho- 
sen for inclusion in set 2. The lower bound of distance 3 
assured that new and significantly different variants were 
chosen, and the upper bound of distance 5 limited the risk 
of encountering non-viable combinations. 

The last 8 variants of set 2 (2-13 to 2-20, design method 
1 in Additional file 2) were designed in the same way as 2- 
7 to 2-12, except that instead of using the activities after 
heating for the machine learning, we used the activities 
before heating. The reason for this was that only 19 of the 
59 variants had detectable activities after heating, but 32 
had detectable activities before heating. The unheated 
measurements thus provided a better dataset for machine 
learning and we reasoned that they would increase the 
diversity of designs of active proteinase K variants. This is 
discussed in more detail in section 3.6. The 0.04 threshold 
was not subtracted from the weights for designs H and I 
(our aim was to design a set of active but different vari- 
ants) and the four substitutions that were excluded from 
2-1 through 2-6 (M145F, S123A, E132A and V267I) were 
included in 2-7 through 2-20. 

2. 7 Testing and machine learning analysis of variant set 2 
The first set of machine learning-based designs signifi- 
cantly outperformed those based on a manual "expert" 
analysis in set 1 . Thirteen of the twenty variants in set 2 

were more active than wild-type proteinase K, with 8 more 
active than the most active variant from set 1 (Figure 2 and 
Additional file 2]). Encouraged by this result we per- 
formed a second cycle of machine learning. 

The sequence and activity data from the first and second 
set of variants was combined and analyzed as before using 
each of 8 different machine learning algorithms to build 
linear models of the sequence and activity as described in 
section 2.5. The mean weights and standard deviations 
calculated by each algorithm are shown in Table 2, and 
shown graphically for one algorithm (MR) in Figtire 3 A. 

Because we now had more sequence and activity data, we 
also performed a rudimentary test of regression models 
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Table 2: Vector weights calculated for amino acid substitutions. 





RR 




Lasso 






PLSR 






SVMR 




Substitution 


M 


a 


M 


a 


M 




a 


M 




a 


SI07D 


-0.03 


0.13 


0.00 


0.02 


-0.70 




0.26 


-0.16 




0.13 


SI23A 


-1.00 


0.13 


-0.41 


0.35 


-1.42 




0.23 


-0.93 




0.14 


II 32V 


0.04 


0.44 


0.04 


0.55 


0.32 




0.76 


-0.34 




0.29 


MI45F 


-1.46 


0.19 


-2.27 


0.49 


-1.98 




0.32 


-1.58 




0.20 


Yl 5 lA 


1.18 


0.23 


0.9 1 


0.23 


1.66 




0.37 


0.9! 




0.1 5 


VI 671 


-0.97 


0.13 


-1.09 


0.15 


-I.IO 




0.17 


-0.79 




0.14 


1 lAni 




n 1 ^ 


-0.05 


0.10 






\f, 1 7 








YI94S 


0.27 


0.20 


0.00 


0.0! 


0.94 




0.73 


0.0! 




0.14 


AI99S 


-1.16 


0.39 


-1.09 


0.46 


-2.66 




0.98 


-0.86 




0.2! 


K208H 


0.28 


0.15 


0.07 


0.12 


0.52 




0.18 


0.36 




0.17 


R237N 


-0.93 


0.09 


-0.91 


0.13 


-1.21 




0.12 


-0.86 




0.15 


V267I 


-0.48 


0.1 1 


-0.32 


0.14 


-0.68 




0.13 


-0.16 




0.12 


S273T 


0.12 


0.14 


0.0! 


0.06 


0.28 




0.19 


-0.05 




0.17 


G293A 


1.95 


0.13 


2.24 


0.14 


2.10 




0.17 


1.70 




0.13 


K332R 


0.07 


0.13 


-0.01 


0.05 


0.02 




0.14 


0.09 




0.15 


S337N 


-0.02 


0.14 


0.03 


0.09 


-0.20 




0.15 


0.03 




0.14 


P355S 


-1.08 


0.12 


-1.20 


0.15 


-1.25 




0.13 


-I.IO 




0.15 




LPSVMR 




LPBoostR 




MR 






ORMR 




Substitution 


M 


a 


M 


a 


M 




a 


M 




a 


SI07D 


-0.0! 


0.20 


-0.02 


0.2! 


-0.35 




0.24 


-0.35 




0.24 


SI 23 A 


-0.41 


0.43 


-0.40 


0.41 


0.52 




0.93 


0.52 




0.93 


II 32V 


0.22 


0.69 


0.18 


0.56 


2.61 




0.91 


2.61 




0.91 


MI45F 


-2.39 


0.53 


-2.39 


0.53 


-5.33 




1.05 


-5.33 




1.05 


Y 1 5 1 A 


0.82 


0.24 


0.82 


0.25 


0.64 




^ ^ A 

0.24 


^ y A 

0.64 




0.24 


VI 671 


-0.99 


0.24 


-0.98 


0.23 


-1.63 




0.24 


-1.63 




0.24 


LI 801 


-0.20 


0.16 


-0.19 


0.16 


0.60 




0.23 


0.60 




0.23 


YI94S 


-0.02 


0.08 


0.21 


0.38 


4.59 




2.10 


4.59 




2.10 


AI99S 


-0.49 


0.33 


-0.73 


0.54 


-4.92 




2.03 


-4.92 




2.03 


K208H 


0.16 


0.18 


0.14 


0.18 


0.01 




0.13 


0.01 




0.13 


R237N 


-0.96 


0.29 


-0.96 


0.29 


-1.59 




0.57 


-1.59 




0.57 


V267I 


-0.41 


0.23 


-0.41 


0.23 


-1.33 




0.14 


-1.33 




0.14 


S273T 


0.19 


0.33 


0.15 


0.28 


0.96 




0.58 


0.96 




0.58 


G293A 


2.18 


0.25 


2.20 


0.25 


3.20 




0.14 


3.20 




0.14 


K332R 


0.12 


0.19 


0.14 


0.2! 


-0.33 




0.13 


-0.33 




0.13 


S337N 


0.27 


0.28 


0.26 


0.28 


0.34 




0.59 


0.34 




0.59 


P355S 


-1.34 


0.35 


-1.35 


0.34 


-1.95 




0.57 


-1.95 




0.57 



Mean (M) and standard deviation (a) values are shown for the 1 9 substitutions for which weights were calculated using machine learning. The values 
were calculated fronn 1000 subsamples of the variants with measurable activity from sets I and 2, where 5 variant sequences were randomly 
omitted from each subsample. 



that consider epistatic interactions between the selected 
substitutions. We did this by asking whether models con- 
taining epistatic interactions would result in a better fit 
between the observed and predicted activities of the vari- 
ants in the training set. 

Ideally we would like to know for each pair of positions 
(A, B) which of the 4 combinations of amino acids present 
at A and B maximize the activity. For 19 substitutions 



there are a total of 171 total possible pairs to consider (: A 
with B, A with C, A with D,...B with C, B with D, etc). Each 
pair consists of 4 possible states: both wild-type, both sub- 
stituted and 2 possibilities in which only 1 is substituted. 
Thus to perform one separate test for each possible com- 
bination in every pair would require 684 variants. To test 
each of these combinations in at least 3 different sequence 
contexts could require >2,000 variants, which would be 
quite impractical. Since computational resources are 
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Figure 3 

Substitution weigiit mean and standard deviation values produced by the MR algorithm. We created 1000 sub- 
samples of the training set (the sequences and non-zero activities of variants from sets I and 2) by leaving out 5 randomly 
selected variants from each subsample. A: The MR (matching loss) algorithm was used to calculate substitution weights for 
each subsample. The mean values from the 1000 subsamples are indicated by horizontal notches. Error bars represent one 
standard deviation of the 1000 calculated substitution weights. Substitutions are indicated below the graph with the number of 
occurrences in the training set in parentheses. Each substitution is described by a single weight. Variant 3—4 was designed to 
include all substitutions with positive mean weight that occur at least 3 times in the training set (red and blue circles). Note 
that substitution YI94S (green circle) was not selected since it occurred less than 3 times in the training set. Variant 3—9 
included all substitutions that occurred at least 3 times and whose mean weight was at least one standard deviation above zero 
(red circles only). Substitution weights calculated from the entire dataset instead of the mean of 1000 subsamples are shown as 
purple circles. B: The MR algorithm was used to calculate substitution weights as in A, except that models were tested by 
expanding each pair in turn into 4 terms and selecting the pair that most improved the model. In this example each substitution 
is described by a single weight except for the 3 pairs (1 32,208), (337,355), (267,293) which are modeled by 4 weights each. Re 
d circles indicate the substitutions selected to design variant 3—14. Note that substitution combination 1 1 32V 208K was not 
selected since it occurred less than 3 times in the training set. 
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cheap we instead used the 1000 subsamples of the train- 
ing sets to "virtually" test which pairs of substitutions led 
to better predictions by our algorithms. 

To do this we expanded one amino acid pair at a time into 
its four possible combinations and optimized the new 
weight vector. Thus, for each of the 171 possible position 
pairs (A, B), we built a model from each of the 1000 sub- 
samples using one weight for each position except for the 
(A, B) pair for which we used 4 weights: one for each pos- 
sible combination of the substitutions at position A and 
B. We computed the loss of the linear models on predict- 
ing activities of the 5 variants held out from each subsam- 
ple (the loss quantifies the differences between the 
predicted and measured activities) and averaged the loss 
over the 1000 subsamples. If one or more pairs of substi- 
tutions improved the model (ie reduced the mean loss) 
we fixed the pair that produced models with lowest mean 
loss and then repeated the process. Each time we picked 
the pair of positions that produced the largest reduction in 
the average loss. We stopped expanding amino acid pairs 
when no further reduction of the mean loss occurred. 

Examples of weights calculated by considering amino acid 
pairs are shown in Figure 3B. A comparison of Figure 3A 
with 3B shows that the expansion of 3 amino acid pairs 
into 4 combinations produced a model that also modified 
the weights of the single substitutions. Thus SI 23 A goes 
from being less than 1 standard deviation above zero to 
more than 1 standard deviation above zero. One substitu- 
tion (Y194S) goes from being very positive to negative 
(green circles in Figure 3). This substitution was only rep- 
resented twice in active variants in the training set, and 
both of these variants had very low activities (variants 1- 
12 and 1-35; Additional file 2). The consequences of 
these weights for variant design are discussed in sections 
3.2 and 3.3. 

2.8 Design and testing of variant set 3 

In designing variant set 3 we aimed to obtain further 
increases in proteinase K activity. We also wanted to test 
whether new variant designs were improved either by 
accoimting for the general context dependence of a substi- 
tution, or by considering epistatic interactions. Variant set 
3 was therefore designed in 3 parts to answer these 3 ques- 
tions. 

Our first two designs used only linear models. We selected 
one sequence for each algorithm by combining substitu- 
tions whose weights were calculated by that algorithm to 
be greater than zero (variants 3-1 to 3-5, design method J 
in Additional file 2). We selected a second sequence for 
each algorithm by combining substitutions whose 
weights were calculated by that algorithm to be at least 1 
standard deviation greater than zero (variants 3-6 to 3-9, 



design method K in Additional file 2). Values for the mean 
and standard deviations for substitution weights calcu- 
lated by each method are shown in Table 2. 

The third design used models that considered amino acid 
pairs. We selected one sequence for each algorithm by 
combining substitutions or substitution pairs whose 
weights were calculated by that algorithm to be at least 1 
standard deviation greater than zero (variants 3-10 to 3- 
16, design method L in Additional file 2). When more 
than one pair had a positive substitution weight, we 
selected the pair with the highest value when the standard 
deviation was subtracted from the mean. Thus in Figure 
3B we chose 337S, 355P over S337N, 355P and S337N, 
P355S although the weights for all three combinations 
were more than 1 standard deviation above zero. As for 
designs from the linear models, we only included a com- 
bination for a pair if that combination was present in the 
training set at least 3 times. Thus in Figure 3B we rejected 
1 1 32V, 20 8 K because it was present only twice, but instead 
chose 132IK, 208H which had a slightly higher value than 
1321, 208K. 

For every machine learning algorithm, the design that 
incorporated substitutions only when the mean substitu- 
tion weights were at least 1 standard deviation above zero, 
outperformed the design that incorporated substitutions 
when the mean substitution weights were simply greater 
than zero. There was no clear pattern when epistatic mod- 
els were used: the data is shown in Figure 4 and discussed 
in more depth in section 3.3. 

3. Analysis of the design methods 

3. 1 Functional contributions of amino add substitutions 

Ten of the initial set of 24 substitutions that we selected 

had a beneficial effect on proteinase K activity, a success 
rate of 40%. The substitutions were selected from a total 
of more than 7,000 possible (371 positions x 19 alterna- 
tives at each position), by using alignments of homolo- 
gous sequences without the use of any structural 
information. 

Table 1 shows the effect of each of substitution next to the 
method used to select it. Two of the 24 substitutions 
selected (Y151A and G293A) were selected as positive by 
all 8 algorithms in the second analyses (section 2.7). 
These substitutions were incorporated into every variant 
in round 3 and are present in the most active variants. 
Both of these substitutions were chosen from alignments 
of thermostable homologs of proteinase K. In all, eight of 
the ten positive substitutions (S123A, I132V, Y151A, 
LI 801, S273T, G293A, K332R and S337N) were selected 
based on the presence of the new amino acid in an align- 
ment of thermostable homologs. By contrast, four of the 
five substitutions that were not found in any active variant 



Page 10 of 19 

(page number not for citation purposes) 



BMC Biotechnology 2007, 7:16 



http://www.biomedcentral.eom/1472-6750/7/16 



40 



35 



30 



2 25 
> 

I 20 



■> 15 



u 
< 



10 



0 




oo 



II 



rooo COCO 
I I I I 

o 

RR 



II 



i 



I 



ill 



ro CO CO CO 
I I I I 

OJcn 00 00 
Lasso 



"'I 



I 

I I 



1 - 



1 



i i 
I I 1 
III 

III 
III 



M CO CO CO 
I I I I 

PLSR 



K> CO coco 
I I I I 

Ji.COO)-*' 

CO 



tsjco coco 

I I I I 

OOK>00-»> 




SVMR LPSVMR 



rococo CO CO 
I I I I I 

CO ro 00 
LPBoostR 



MR 



T 



ii 



1^ 



4 



■■1 



rocococo isjcococo 



ORMR 



Figure 4 

Activities of variants designed using substitution weights. Activities towards N-Succinyl-Ala-Ala-Pro-Leu p-nitroanilide 
were measured at 37°C following a 5 minute heat treatment of the enzyme at 68°C. Activities are expressed relative to the 
mean activity of duplicates of wild-type proteinase K. Error bars represent one standard deviation of the activity measure- 
ments. Variants are grouped according to the machine learning algorithm used to calculate substitution weights (indicated 
below each group), and are compared with the best variants from the initial design set (variants 1—40 and 1—50 black bars, on 
the left). The first design (yellow bars, design method G in Additional file 2) of each group belongs to set 2. We included a sub- 
stitution in the design if it occurred at least three times in the training set and its mean weight was at least one standard devia- 
tion above zero. All remaining designs in each group belong to set 3. The second in each group (green bars, design method J in 
Additional file 2) includes substitutions occurring at least three times and whose mean weights were merely positive (eg Figure 
3A, red and blue circles). The third in each group (red bars, design method K in Additional file 2) contained all substitutions 
occurring at least three times and whose mean weight was at least one standard deviation above zero (eg Figure 3A, red cir- 
cles). Note that this third design in each group is always better than the second. The last variant(s) in each group (blue bars, 
design method L in Additional file 2) were designed by modeling interdependent substitutions (eg Figure 3B, red circles). 



were designed based on literature reports of their stabiliz- 
ing effects in subtilisin (N95C, P97S, E138A, and 
L299C) [34,35]. We were thus most successful in choosing 
beneficial substitutions by selecting changes that occur in 
homologous natural proteins. 

The positions of all substitutions used are shown mapped 
onto the structure of proteinase K [see Additional files 3, 
4 and 5 ] . We could see no obvious pattern distinguishing 
the locations of beneficial from detrimental substitutions, 
nor were we able to identify simple structural reasons for 
the effect of the substitutions. 



For future extensions of this method, if a target activity is 
not achieved with the initial set of substitutions, addi- 
tional substitutions can be chosen and incorporated into 
a new set of variants along with the best substitutions that 
have already been tested. Results from previous experi- 
ments can improve a second cycle of substitution selec- 
tion. For example, to obtain further improvements in 
proteinase K activity by incorporating new substitutions, 
we would pick more substitutions that appear in align- 
ments of thermostable homologs and avoid those 
reported to confer stability on subtilisin. More data from 
other systems will be required to determine whether the 
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best method for picking substitutions varies depending 
on the protein target or the desired application. In either 
case, using a variety of methods for the initial selection, 
then analyzing the functional contributions of substitu- 
tions selected by different methods is likely to provide a 
good starting point for other protein engineering projects. 

3.2 Representation of substitutions in the training data set 

In our initial set of variant designs (section 2.3) v^e aimed 
to have each amino acid substitution represented more 
than 6 times. Because so many of our random combina- 
tions of substitutions were inactive this resulted in a train- 
ing set where different substitutions were represented very 
unevenly. 

There are two major consequences of underrepresentation 
of a substitution for machine learning analysis. One can 
be seen in Figure 3 A, where the MR algorithm assigned 
Y194S a high weight even though both variants in the 
training set have very low activity. Because there are only 
2 active variants encoding Y194S, the machine learning 
algorithms tend to assign weights to the substitution that 
improve the fit of other substitutions to the model, but do 
not really reflect the contribution of the underrepresented 
substitution. The more the substitution is represented the 
less likely this is to occur because there are more 
datapoints that the weight has to be consistent with. Table 

2 also shows that this phenomenon is dependent on the 
machine learning algorithm used. Two of the three algo- 
rithms that use one-norm regularization (Lasso and LPS- 
VMR) and use fewer larger weights to fit the data give very 
low scores to Y194S. 

A second consideration that arises when a substitution is 
underrepresented is that it is difficult to assess effects of 
context upon the contribution of a substitution. Some- 
times a substitution may be beneficial with one set of 
other substitutions, but deleterious with a different set. 
The fewer times a substitution has been tested, the less 
likely such interactions are to be detected. 

In this study we required that a substitution occur at least 

3 times for us to use it in a subsequent design. For future 
designs of variant sets it will be important to ensure that 
each substitution is adequately represented in the training 
data set. 

3.3 Accounting for interactions between amino acid substitutions 
The extent to which one substitution affects the contribu- 
tion of another substitution to protein function is difficult 

to predict. For proteins that have evolved by the sequen- 
tial accumulation of point mutations, most of those 
mutations must work well in many different contexts. 
This is because each new mutation will produce a new 
sequence context, so an enzyme whose amino acids were 



predominantly very context dependent would be largely 
immutable. This view is supported by a study in which all 
amino acid differences in 15 natural subtilisins were 
recombined by DNA shuffling [49]. Almost all possible 
pairwise combinations of amino acid differences were 
found in functional subtilisin enzymes produced by this 
recombination, suggesting that amino acid covariation 
seen in the original 15 orthologs resulted from common 
ancestral derivation rather than functional constraints. 
Selecting amino acid substitutions that occur in natural 
homologs should therefore provide a useful bias towards 
variations that are tolerated in many contexts. 

Different subsamples of the training set produced differ- 
ent values for the weights of each substitution. This differ- 
ence probably arises from noise in the data as well as from 
possible context effects. To accommodate this variation in 
our designs we used the standard deviation of each weight 
as a measure of its variability of effect. We compared the 
activities of third cycle variants designed by combining all 
substitutions whose mean weights were positive (substi- 
tution weight example shown in Figure 3A blue and red 
circles, activities shown in Figure 4 variants 3-1 to 3-5, 
green bars), with those designed by combining only sub- 
stitutions whose mean weight was more than 1 standard 
deviation above zero (substitution weight example shown 
in Figure 3 A red circles only, activities shown in Figure 4 
variants 3-6 to 3-9, red bars). For every machine learning 
algorithm, the variant that contained only substitutions 
whose mean weights were at least 1 standard deviation 
above zero was more active than the corresponding vari- 
ant that included all substitutions with positive weights. 
The standard deviation of a substitution weight thus 
appears to provide a useful evaluation of the likely contri- 
bution of that substitution to protein function. 

As described in sections 2.7 and 2.8 we also tested designs 
based on regression models that considered epistatic 
amino acid interactions. Activities of variants designed in 
this way are also shown in Figure 4 (variants 3-10 
through 3-16, blue bars, substitution weight example in 
Figure 3B). Only the algorithms PLSR and LPBoostR pro- 
duced more active designs based on modeling amino acid 
interactions than the corresponding designs produced 
when all substitutions were modeled independently. One 
of these, LPBoostR, found the most active of all the 95 
sequences we tested (variants 3-11 and 3-12). However 
we note that different machine learning algorithms 
selected different amino acid pairs for expansion. It is 
therefore unclear to us whether these pairings are actually 
related to epistatic interactions between the amino acids 
themselves, or result from differences in the machine 
learning methods' ways of minimizing discrepancies 
between measured and predicted activities in a small and 
unevenly distributed dataset. 
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Understanding interactions between specific pairs of 
amino acid substitutions is unlikely to limit the protein 
engineering method described here. To test every combi- 
nation of all pairs of amino acid substitutions would rap- 
idly become prohibitively expensive: for 1 9 substitutions 
it would require more than 2000 variants (see section 
2.7). However it is relatively simple to instead select sub- 
stitutions that work well in many contexts and to reject 
those that work well in some contexts but poorly in oth- 
ers. This can be done by using the standard deviation of 
the substitution weight over many subsamples of the 
training set, keeping only those whose mean weights are 
more than one standard deviation above zero. This will 
also ensure that if additional substitutions are incorpo- 
rated subsequently, those substitutions already accepted 
and fixed are likely to be generally tolerant to further 
change. 

3.4 Comparison with other design methods 

As a control to determine whether the same degree of 
activity improvement could be achieved by simpler 
means, we analyzed the activity distribution for 4 sets of 
variants. The first set, taken from the first 49 variants syn- 
thesized, comprised 20 variants which contained arbitrar- 
ily selected combinations of the 19 substitutions 
considered in the machine learning designs (1-2, 1-6, 1- 
12, 1-13 and 1-26 through 1-49; Figure 5, white bars). 
The second set comprised 1 0 variants that were designed 
by our "expert" analysis of the sequence and activity data 
from the first 49 variants (1-50 through 1-59; Figure 5, 
light shading). The third and fourth sets comprised 20 and 
16 variants designed using machine learning analysis (2-1 
through 3-16; Figure 5, dark shading and black fill respec- 
tively). The activities of the randomly designed variants 
are predominantly extremely low: 80% are less than 3% of 
wild type activity and just one is more active than wild 
type. The activities for the variants designed by manual 
data analysis are a little more evenly distributed, but still 
only one is more active than wild type. By comparison 
70% of the variants designed in the first cycle of machine 
learning were more active than wild type and all of the 
variants designed in the second cycle of machine learning 
were at least 3 -fold more active than wild type. While it is 
not possible for us to compare machine learning with all 
available protein engineering methods, this control shows 
that machine learning identified highly functional combi- 
nations of substitutions that could not be readily 
obtained either by random selection or by manual analy- 
sis. 

The machine learning designs, which resulted in enzyme 
activity increases of up to 20-fold, differed from classical 
experimental designs (see for example [9]) because of the 
epistatic effect of some amino acid changes. Amino acid 
changes or pairs of changes that completely eliminate pro- 



tein activity will mask any positive or negative contribu- 
tions made by other substitutions that occur with them. 
Experimental designs such as Taguchi matrices [50,51] 
minimize the number of experiments by combining many 
variables at once. A Taguchi orthogonal design for testing 
24 substitutions would have produced 48 variants con- 
taining different combinations of 12 substitutions and 1 
containing all 24. Our initial design incorporated only 6 
changes into each of 24 variants. Although this was a less 
complete testing of the combinations, because 5 substitu- 
tions abolished enzyme activity, we did obtain 4 variants 
with detectable activity. By synthesizing an additional 8 
variants we were able to identify the 19 functional amino 
acids in our set of substitutions. By contrast the Taguchi 
design would almost certainly have produced only inac- 
tive variants and thus no information. 

Machine learning has also been used in the related 
domain of drug design to search large libraries of small 
molecules for compounds with maximal activity towards 
a biological target. The activity levels of some compounds 
are known and "active learning" methods [52] are used to 
select the next batch of compounds to be tested [53]. 
There are a number of significant differences between 
these searches and those in a protein engineering setting. 
For example small molecules are described by feature vec- 
tors of sizes between 10 and 10^ [54], while each protein 
variant in this study is described by 24 binary features. 
Another difference is that in drug discovery large datasets 
are available for testing various machine learning meth- 
ods [53,55] while no such data exists for proteins. Small 
molecule datasets are also generally quite large, typically 
with 103 to 104 compounds: Fang et al estimate that train- 
ing sets of 10,000 member compounds are required to 
build a predictive model but in this study we tested a total 
of less than 100 variant proteins. In part this is possible 
because our protein descriptors are so much simpler and 
the relatedness of any pair of variants is unambiguous. 
This allowed us to identify improved proteins and then to 
focus on highly related proteins. 

3.5 Multiple protein properties are modified simultaneously 
The activity of proteinase K that we targeted depends on 
activity towards the substrate and heat-stability of the pro- 
tein. We were interested in knowing whether we had mod- 
ified one or both of these properties. A second motivation 
was that we were unable to measure the concentration or 
proteinase K: it auto digested so efficiently that we were 
unable even to visualize it on a gel. Since the half-life of 
the protein at 68°C should be essentially independent of 
the protein concentration, changes in half-life reflect 
changes in the protein itself and not possible influences of 
expression levels. 
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Figure 5 

Machine learning design compared with random choices and "expert'* designs. Distribution of activities of 4 sets of 
variants designed using different methods are shown. Set A (white bars, variants 1—2, 1—6, 1-12, 1-13 and 1—34 to 1—49, total 
of 20 variants) contain arbitrarily selected combinations of 3, 5 or 6 substitutions. Set B (light shading, variants 1-50 to 1-59, 
total of 10 variants) were designed by manual analysis of the sequence and activity data from variants I through 49. Set C (dark 
shading, variants 2- 1 to 2-20, total of 20 variants) were designed using machine learning algorithms based on the data from var- 
iants I through 59. Set D (black fill, variants 3-1 to 3-16, total of 16 variants) were designed using machine learning algorithms 
based on the data from variants I- 1 through 1—59 and 2-1 through 2—20. 



We measured the activity towards the substrate and the 
half-life at 68 °C of 13 of the best variants. Figure 6A 
shows the activity of wild type proteinase K following dif- 
ferent exposures to 68 ° C, Figure 6B shows one of the third 
cycle variants (3-9) after the same heating times. Figure 7 
shows the activity without heating (white bars) and the 
half-life (shaded bars) for wild type proteinase K and 13 
third cycle variants, as well as the substitutions in each 
variant. With combinations drawn only from a small set 
of 24 selected substitutions, a significant diversity of func- 
tional combinations provided the desired outcome, from 



variants in which the primary effect was increasing overall 
activity (3-3) to those in which both activity and half-life 
were improved (3-11). We expect that this pattern would 
continue if we attempted to deconvolute further. For 
example, the increase in activity without heat treatment is 
probably a combination of increased specific activity and 
increased protein expression levels, with varying contribu- 
tions from each activity in each variant. Most importantly 
for the approach described here, several properties were 
altered simultaneously to improve an activity that 
depended on multiple properties. 
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Figure 6 

Increases in proteinase K activity witii and witiiout iieating. Proteinase K variants were tested from triplicate inde- 
pendent cultures for activity after heating at 68°C for different times: unheated (circles), 2.5 minutes (squares), 5 minutes 
(crosses), 7.5 minutes (triangles), 10 minutes (diamonds) and 15 minutes (open squares). A: absorbance at 405 nm of substrate 
incubated with wild type proteinase K, B: absorbance at 405 nm of substrate incubated with variant 3—9. 



3.6 Different machine learning predictions from different data sets 
Different parts of variant set 2 were designed using differ- 
ent data sets (see section 2.6). Variants 2-1 to 2-12 were 
designed using the activity of the first set of protein vari- 
ants after heat-treatment, while variants 2-13 to 2-20 
used the activity without heat treatment. Variants 2-13 to 
2-20 contained new combinations of the substitutions 
incorporated into 2-1 to 2-12, as well as 2 that were not 
included using only the heated data (S273T and VI 671; 
Additional file 2). Five variants designed using the 
unheated data were more active than wild type, approxi- 
mately the same proportion as those that were designed 
using the activities after heating (Additional file 2). 

Although some variants designed using the unheated 
activities were active after heating, the activities without 
heating were in no way intended as a surrogate for the 
activity after heating. We used the unheated activities only 
to obtain a larger set of sequences, but did not measure 
the activities of variant sets 2 or 3 without heating. There 
are clearly combinations of substitutions that produce 
increased activity over wild type without heat treatment, 
but lower activity than wild type after heating (eg 1-13, 1- 
30, 1-37, 1-43 and 1-47). If we had performed several 
cycles of engineering using only the unheated activities, 
we would therefore expect only a subset to be active after 
heating. 



3.7 Differences in predictions of the machine learning algorithms 

The different machine learning algorithms did not con- 
verge to the same sequence design. The eight algorithms 
produced 4 different variant designs (3-6 through 3-9) 
using the same training data set. These differences in turn 
arose from differences in calculated mean and standard 
deviations for the substitution weights (shown in Table 
2), which themselves resulted from differences in the way 
in which the algorithms model the data. The activities of 
the variants designed using different machine learning 
algorithms were very comparable (see Figure 4), and we 
were unable to really distinguish between them by their 
performances. The comparable performance of all the 
machine learning algorithms we used is probably due to 
the fact that we have too few example proteins. We expect 
that with more examples, clear differences between the 
algorithms could appear. Testing this hypothesis will 
require analysis of additional datasets. 

It is unclear from the activity data whether there is a single 
optimal sequence, although there are clearly many 
improved sequences. For example the two most ther- 
mostable variants, 3-9 and 3-11, share 3 substitutions 
but differ at 5 positions (see Figure 7). The substitutions 
1 132V and LI 801 appear in 3-9 but not in 3-1 1. Addition 
of either of these 2 substitutions to 3-1 1 leads to variants 
with lower thermostability than either 3-9 or 3-1 1 (vari- 
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Figure 7 

Changes in activity and half-life in designed protein variants. Activity (unlieated) and lialf life were calculated for 1 3 
protein variants and wild type proteinase K. The activity without heating was calculated from the initial slopes of the A4Q5 meas- 
urements without heating (white bars), examples shown in Figure 6. The half-life at 68°C (shaded bars) was calculated using the 
initial slopes after different heating times and fitting to an exponential curve. Error bars represent one standard deviation of the 
experimental measurements. The wild-type values are shown on the left. The substitutions of each variant are given in the col- 
umn below the variant name. Only 10 of the 19 positions are shown. In the remaining 9 positions, all variants contained amino 
acids from the wild-type sequence. 



ants 3-2 and 3-12). Thus the effect of these two substitu- 
tions appears to be influenced by other changes in the 
protein. This context dependence of substitutions suggests 
a limitation for approaches such as site saturation muta- 
genesis [56], in which all changes are considered inde- 
pendently and then combined based only on their 
behaviour in the wild- type context. 

Conclusion 

We have developed a new synthetic biology approach to 
protein engineering in which amino acid substitutions are 
selected, incorporated in different but defined combina- 
tions into a small number of variant enzymes which are 
individually synthesized and tested functionally. Machine 
learning algorithms are then used to assign values to the 
functional contribution of each substitution, which serves 
as the basis for a further set of variant designs. The process 



is repeated until a target activity is achieved. We have 
tested the approach using proteinase K as a target protein. 

Substitutions that improved the activity of proteinase K 
were primarily identified using alignments of naturally 
occurring homologous proteins, structural information 
was not used. The exponential accumulation of natural 
DNA sequences [57,58] could facilitate the use of phylo- 
genetic information for substitution selection in many 
other systems, helping to remove the prerequisite of 
obtaining high resolution crystal structures before initiat- 
ing a protein engineering project. 

We tested 8 different machine learning algorithms and 
foimd them all able to produce predictive models describ- 
ing the contributions of individual amino acid substitu- 
tions to the activity of proteinase K. We also found that it 
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was unnecessary to consider all possible amino acid inter- 
actions to obtain substantial improvements in protein 
activity. However, it was advantageous to use the machine 
learning models to identify (and eliminate) substitutions 
whose effect appeared to vary significantly depending on 
the sequence context. 

By designing, synthesizing and testing a total of only 95 
specific proteinase K variants, of which 36 were designed 
using machine learning algorithms, we obtained a 20-fold 
increase in protein activity. Application of the strategy 
described here to other systems should allow proteins to 
be optimized using functional measurements for small 
numbers of protein variants. This would obviate the need 
for library construction and high throughput screening. 
Instead, variants could be directly tested in complex low- 
throughput assays that accurately reflect the combination 
of properties desired for the final application of the opti- 
mized protein. 

Methods 
Gene synthesis 

A proteinase K-encoding gene was designed using Protein- 
2-DNA software [30] to select a codon distribution mim- 
icking natural highly expressed E. coli proteins [31]. The 
gene was assembled from chemically synthesized oligode- 
oxyribonucleo tides (purchased from Operon) as 
described previously [59,60], and cloned into pBAD/glll 
(Invitrogen) . Variant genes were synthesized by replacing 
oligonucleotides encoding the amino acids with those 
encoding the desired amino acid substitutions. Variants 
were cloned between the Ncol and Sail sites of the 
sequence [see Additional file 1 ] . 

Proteinase K expression and purification 

Proteinase K was expressed in the E. coli periplasm and 
purified on Ni-NTA. Briefly, a single colony of E. coli car- 
r5^ng a proteinase K variant in pBAD/glll was picked from 
a carbenicillin plate and grown overnight. Forty microlit- 
ers of culture was then diluted into 4 ml pre-warmed LB- 
carbenicillin, grown for 3 to 4 hours (to an A^qq of 0.2- 
0.3), arabinose was added to 0.2% w/v and the cells were 
grown overnight. All growth was performed at 3 0 ° C in LB 
containing 50 |ig/ml carbenicillin. Cells were pelleted in a 
microfuge, thoroughly resusp ended in 200 jul of 20% wv 
sucrose, 200 mM NaH2P04, pH 7.4, 1 mM EDTA and 30 
U/jLil lysozyme (freshly added from a 30,000 U/jli1 stock: 
Ready- Lyse, Epicentre) and incubated at 25 ° C for 5 min- 
utes. Cells were subjected to osmotic shock by addition of 
200 jLil ice-cold water, mixed by inversion and incubated 
for 5 minutes on ice to release periplasmic protein. Cells 
were pelleted in a microcentrifuge. The supernatant was 
removed, adjusted to 300 mM NaCl, 10 mM imidazole, 
67 mM NaH2P04, pH 7.4 and loaded onto an Ni-NTA col- 
umn (Qiagen) pre- equilibrated with 50 mM NaH2P04, 



pH 7.4, 300 mM NaCl, 10 mM imidazole. The column 
was washed twice with 600 jal of 50 mM NaH2P04, pH 
7.4, 300 mM NaCl, 20 mM imidazole and then eluted 
twice with 100 ^1 of 50 mMTris-Cl pH 7.4, 300 mM NaCl, 
250 mM imidazole. The two eluates for each culture were 
pooled. 

Proteinase K activity measurements 

Proteinase K Ni-NTA eluates were heat-treated in a PCR 
machine at 68°C for 5 minutes. Proteinase K activity was 
measured by addition of 10 jliI Ni-NTA eluate to 90 jLtl 
reaction buffer. Final reaction conditions were 50 mM 
Tris-Cl pH 7.4, 180 mM NaCl, 5 mM CaCl^, 25 mM imi- 
dazole, 500 jLiM N-Succinyl-Ala- Ala-Pro-Leu p-nitroani- 
lide substrate (Sigma S-85 11). The reaction was incubated 
at 37 °C, and followed by measuring absorbance at 405 
nm. Activities were calculated from the initial rate of reac- 
tion and comparison with a standard curve constructed 
using 4-nitroanilide [61-63]. Because proteinase K self- 
digests we were unable to accurately determine protein 
concentration. Activities are therefore expressed either rel- 
ative to the wild-type enzyme activity, or as pmol sub- 
strate hydrolyzed per second per ml of initial culture from 
which the proteinase K variant was purified (pmol/s/ml). 
The activities measured for each sequence in Set 1 and Set 
2 are shown in Additional file 2. 

Proteinase K half-life calculations 

Proteinase K Ni-NTA eluates were heat-treated in a PCR 
machine at 68 °C for 0, 2.5, 5, 7.5, 10 and 15 minutes. 
Proteinase K activity was measured at 37°C, and followed 
by measuring absorbance at 405 nm. Initial reaction rates 
were plotted against heating time and fitted to an expo- 
nential curve. 

Machine learning algorithms 

Machine learning algorithms and methods used for vari- 
ant design are detailed in Additional file 1. Variant set 1 
(active variants in Additional file 2) was the training data 
set used for the design of set 2. Multiple measurements of 

the same variant were treated as separate pairs. All algo- 
rithms were implemented using commercially available 
software (Matlab from Mathworks). 
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